Model Evaluation (mostly, for regression, but w/ some more general points)

This notebook is going to cover a bunch of things related to model evaluation for regression, with a specific focus on linear and polynomial regression:

  1. Coefficient interpretation - one part of evaluation is looking at what our model actually learned. We're going to touch on this, but only for linear regression, here.
  2. Making predictions - when we have a model we want to evaluate, we have to actually know what predictions it is making!
  3. Identifying some models to compare to - it is fun to think about how well a single model does, but often it is hard to judge a model solely on it's own performance (we'll discuss why below).
  4. Evaluating models when we know the true data generating process - in the real world, we don't. But simulating cases where we do can teach us a lot! We can then look at the true bias and variance of the model!
  5. Evaluating models when we don't know the true data generating process - here, we have to rely on empirical estimates of error, but we have to be really careful when we do that. We'll introduce some main ideas here, and continue this in future lectures

Interpreting coefficients for linear regression

One important thing we might want to know from a given linear regression model is how any given feature relates to our prediction. We can do this by analyzing the regression coefficients.

The key thing to remember is that a linear regressioncoefficient tells us how much our modeled outcome changes with a one-unit change in $x$.

If we have multiple predictors, i.e. ${\bf x}$, then we append to this above statement holding all other predictors fixed.

Some considerations in interpreting coefficients

Correlation is not causation

Regression models do not care what the $x$ and $y$ are. Just because $x$ predicts $y$, then, isn't a causal statement, because it is equally valid to flip the model around and say that $y$ causes $x$. There are at least two ways to get to causation: 1) to use social theory, and 2) to use causal inference. Both come with many assumptions, we will discuss this later in the semetser.

Knowing a relationship is "real"

As we discuss elsewhere in class, our model learns parameters based on training data that is simply a sample from a population. Thus, one reasonable question to ask is, how do I know if the relationship between my $x$ and my $y$, as measured by the regression coefficients/parameters, is "real"? There are many ways to answer this question. Perhaps the most obvious and well-established is to place a confidence interval on your parameter estimate using statistical inference. As always StatQuest has a nice explanation of this. We will discuss this in more detail at some point later in class. But another approach to thinking about this for ML, where we often care mainly about prediction, is that this relationship is "real" if it is useful in making predictions. We're going to stick with that one for now.

Interpretation is based on variable scale

Your interpretation of your features depends on the scale of both $x$ and $y$. Sometimes, to make interpretation easier, you may want to rescale your input, or your outputs. Some points to remember:

Assuming you have an intercept in your model, and no interaction terms, rescaling changes your interpretation of a linear model, but not the actual fit. Here is a useful explanation of why.

Exercise: In this example, what is more important, deaths or cases? Do our regression coefficients tell us that?

Changing the scale of your $y$ changes your interpretation as well! We discussed an example in class, and there is one on your programming assignment. There's some useful resources on the internet for this point if you want to learn more, in particular, here is one useful exploration for how to interpret linear coefficients with a logged outcome.

Finally, we will see models where rescaling/standardizing your variables is critical for model performance. Linear regression is not one of them, but other regression and classification approaches definitely are impacted by the scale of your input variables. More on this later!

Interpretation is based on what your null model represents!

The most obvious place this comes up is in the context of using categorical variables in your regression.

Consider a regression model where we are trying to predict the amount of water you drink in a day. We have one feature, whether or not you have exercised:

OK! Let's try to fit that!

Exercise: What happened?

... OK, let's fix the issue, naively

Hmph... that worked, but ... that doesn't match our true function! Note, however, that 13-10 = 3, which is what our expected result would have been for a single coefficient.

Exercise: what happened? (don't peek!)

... note that our regression function as specified is:

$ y = w_0 + w_1*exercised + w_2 no\_exercise + \epsilon$

Now, note that: $exercised = 1 - no\_exercise$, so we can rewrite that as:

$ y = w_0 + w_1*exercised + w_2 - w_2 exercised + \epsilon $, and so:

$ y = (w_0 + w_2) + (w_1-w_2) exercised + \epsilon$

Exercise: How many $w_1$ and $w_2$s can we find that solve this equation?

Note: the closed form solution wouldn't even have worked here! (Because we can't invert this matrix). But sklearn still gives us an output! Here is an explanation of why, you're not responsible for understanding that explanation, though.


OK, so, now we know that we should arbitrarily pick either yes or no and drop that column. Here, it makes sense to drop no, although in many cases this choice is arbitrary. Having done so, our model becomes:

$ y = w_0 + w_1*exercised + \epsilon$

Exercise: What does $w_0$ now represent?


Finally, note that all of this logic extends to categorical variables, which we "one-hot encode" too!

Aside: not all transformations are linear rescaling. Exercise: What is an example of a transformation that might help us make predictions?

Making predictions with a model

To evaluate our model's predictions, we have to be able to actually make those predictions. At a high level, we make predictions by simply subbing in the $x$ value in the test data into our trained model. That is, assume we have trained a model $f_{\hat{w}}$ (using the notation in the UW course). Then to make a prediction on a test point, we evaluate $f_{\hat{w}}$ at $x$, i.e. we compute $f_{\hat{w}}(x)$.

Finding some models to compare

It can in many cases be useful to evaluate a single model, but typically it is very hard to know how well you are making predictions if you look at one and only one model combined with one and only one set of features. Exercise: why?

In class, we generalized regression, following the end of Chapter 1 from Hunter Schafer's book. You are responsible for understanding this generalized equation.

At a high level, though, there are a few different things we can do to try to make different/better predictions:

  1. Change our features - we can try to collect more features, or manipulate/transform the ones we have
  2. Change our model - increase or decrease the complexity of our model
  3. Change our optimization proceedure - if we have a complex surface for our loss function, doing better optimization can

Note that, as we discussed in class, 1 and 2 are tightly intertwined, in that some models (e.g. polynomial regression) implicitly "add" to our features

Actually comparing models ... when we know the truth

!!!!!!!! Disclaimer: this is a modified version of Hunter Schafer's demo [here](https://courses.cs.washington.edu/courses/cse416/18sp/notebooks/html/bias-variance.html) !!!!!

We're going to do this "actual model comparison" in two ways. In the first, we're going to assume we know the true function. If we know the true function, we can assess the bias/variance tradeoff of our model by direct evaluation of model bias and variance!

In this section, we are going to:

Reminder: the high-level goal

One way to think about ML is, as we have discussed, in terms of trying to estimate (conditional probability) functions. A good way to think about this is to think about there being some true function f that we want to learn.

Sadly, we don't know the functional form (linear? polynomial?) or the parameters of the model (in terms of what we have seen so far, what are the ws?).

Our goal in ML is to learn f given only one random training set that gives us some x and some y, where y is, as noted further below, the output of f(x) plus some irreducible noise.

Exercises

Where does error come from?

In lecture, we discussed how the source of error in our estimation of f(x) using training data comes from one of three places:

Remember this picture for bias/variance:

There is a fundamental tradeoff between bias and variance that depends on how complex your model is. Very simple models (with what we've seen so far, models with few parameters) have high bias since your true function is usually not constant, but have low variance since they generally don't have the complexity to fit the noise of the specific dataset you got. Very complex models (high degree polynomials) have low bias since you can get a descent approximation of the true function in expectation, but have high variance since they have the capabilities of fitting the noise in the data.

This section has some code examples to demonstrate how this bias variance tradeoff occurs with different model complexities using synthetic data.`

Step 1 - Define our "true function" we're going to try to learn

Step 2 - Define functions to simulate random training dataset

The generate_data function below will generate x values uniformly at random in [min_x, max_x] and then assign the values using the function f and Gaussian noise with mean 0 and variance noise_sd.

Below is an example dataset we might observe.

If you run it multiple times, you would most likely get different values. Exercise: Why?

This feels kind of unrealistic - we only have one training dataset! True. But let's say our training dataset is a set of tweets sent today, and we train a model to predict number of retweets. There's a lot of randomness in that sample, and if we sampled again tomorrow, it would likely lead to a (slightly) different model!

Step 3 - Learn model(s) based on (a set of) training datasets

The following function will learn some given number of models using random samples from our training set. Exercise: how are we getting those random samples?

This is technique of approximating getting new datasets from the underlying distribution with only our one dataset is known as resampling.

We will use each random (re)sample of our training set to train a model

These models are all different, even though they're estimating the same f!!!

Exercises

Step 4 - Make a bunch of predictions at various values of x

Now, we want to think about how different forms we could specify for our guess at f lead to different levels of bias and variance. To do that, we're going to train a bunch of models, and use them all to make predictions at a bunch of values of x. Let's call each model $f_{\hat{w}}(x)$, representing the fact that they are all approximations of f based on different parameters $\hat{w}$. We're then going to plot the range of the predictions across all models to get a sense of variance, and $\mathbb{E_{\hat{w}}}[f_{\hat{w}}(x)]$ to get a sense of the bias.

Step 4 - Run some experiments!

This is the main function to run our experiments. See the cell aftewards how to call it.

The way to read each of the graphs is:

Exercises

Model evaluation without access to the truth

We've already seen some of this stuff in Lecture 4 with our introduction to why we need a train/test split. But there's more to understand here!

Like last time, we're going to run through some lecture notes to build out the intuition first, and then play in the notebook and notes next week (Tuesday)